knitr::opts_chunk$set(
echo = TRUE,
dev = "png",
dpi = 150,
fig.align = "center",
comment = NA
)
library(rstan)
## Loading required package: StanHeaders
## Loading required package: ggplot2
## rstan (Version 2.21.2, GitRev: 2e1f913d3ca3)
## For execution on a local, multicore CPU with excess RAM we recommend calling
## options(mc.cores = parallel::detectCores()).
## To avoid recompilation of unchanged Stan programs, we recommend calling
## rstan_options(auto_write = TRUE)
## Do not specify '-march=native' in 'LOCAL_CPPFLAGS' or a Makevars file
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
library(bayesplot)
## This is bayesplot version 1.8.1
## - Online documentation and vignettes at mc-stan.org/bayesplot
## - bayesplot theme set to bayesplot::theme_default()
## * Does _not_ affect other ggplot2 plots
## * See ?bayesplot_theme_set for details on theme setting
theme_set(bayesplot::theme_default())
# seed for R's pseudo-RNGs, not Stan's
set.seed(1123)
Imagine that you are a statistician or data scientist working as an independent contractor. One of your clients is a company that owns many residential buildings throughout New York City. The property manager explains that they are concerned about the number of cockroach complaints that they receive from their buildings. Previously the company has offered monthly visits from a pest inspector as a solution to this problem. While this is the default solution of many property managers in NYC, the tenants are rarely home when the inspector visits, and so the manager reasons that this is a relatively expensive solution that is currently not very effective.
One alternative to this problem is to deploy long term bait stations. In this alternative, child and pet safe bait stations are installed throughout the apartment building. Cockroaches obtain quick acting poison from these stations and distribute it throughout the colony. The manufacturer of these bait stations provides some indication of the space-to-bait efficacy, but the manager suspects that this guidance was not calculated with NYC roaches in mind. NYC roaches, the manager rationalizes, have more hustle than traditional roaches; and NYC buildings are built differently than other common residential buildings in the US. This is particularly important as the unit cost for each bait station per year is quite high.
A subset of the company’s buildings have been randomly selected for an experiment:
We will model the number of complaints as a function of the number of traps and other predictors. We want to find a good model on the ground of some posterior predictive diagnostics.
The data provided to us is in a file called pest_data.RDS. Let’s load the data and see what the structure is:
pest_data <- readRDS('data/pest_data.RDS')
str(pest_data)
'data.frame': 120 obs. of 14 variables:
$ mus : num 0.369 0.359 0.282 0.129 0.452 ...
$ building_id : int 37 37 37 37 37 37 37 37 37 37 ...
$ wk_ind : int 1 2 3 4 5 6 7 8 9 10 ...
$ date : Date, format: "2017-01-15" "2017-02-14" ...
$ traps : num 8 8 9 10 11 11 10 10 9 9 ...
$ floors : num 8 8 8 8 8 8 8 8 8 8 ...
$ sq_footage_p_floor : num 5149 5149 5149 5149 5149 ...
$ live_in_super : num 0 0 0 0 0 0 0 0 0 0 ...
$ monthly_average_rent: num 3847 3847 3847 3847 3847 ...
$ average_tenant_age : num 53.9 53.9 53.9 53.9 53.9 ...
$ age_of_building : num 47 47 47 47 47 47 47 47 47 47 ...
$ total_sq_foot : num 41192 41192 41192 41192 41192 ...
$ month : num 1 2 3 4 5 6 7 8 9 10 ...
$ complaints : num 1 3 0 1 0 0 4 3 2 2 ...
We have access to the following fields:
complaints: Number of complaints per building per monthbuilding_id: The unique building identifiertraps: The number of traps used per month per buildingdate: The date at which the number of complaints are recordedlive_in_super: An indicator for whether the building as a live-in superage_of_building: The age of the buildingtotal_sq_foot: The total square footage of the buildingaverage_tenant_age: The average age of the tenants per buildingmonthly_average_rent: The average monthly rent per buildingfloors: The number of floors per buildingFirst, let’s see how many buildings we have data for:
N_buildings <- length(unique(pest_data$building_id))
N_buildings
[1] 10
And make some plots of the raw data:
ggplot(pest_data, aes(x = complaints)) +
geom_bar()
ggplot(pest_data, aes(x = traps, y = complaints, color = live_in_super == TRUE)) +
geom_jitter()
ggplot(pest_data, aes(x = date, y = complaints, color = live_in_super == TRUE)) +
geom_line(aes(linetype = "Number of complaints")) +
geom_point(color = "black") +
geom_line(aes(y = traps, linetype = "Number of traps"), color = "black", size = 0.25) +
facet_wrap(~building_id, scales = "free", ncol = 2, labeller = label_both) +
scale_x_date(name = "Month", date_labels = "%b") +
scale_y_continuous(name = "", limits = range(pest_data$complaints)) +
scale_linetype_discrete(name = "") +
scale_color_discrete(name = "Live-in super")
The first question we’ll look at is just whether the number of complaints per building per month is associated with the number of bait stations per building per month, ignoring the temporal and across-building variation (we’ll come back to those sources of variation later in the document). That requires only two variables, \(\textrm{complaints}\) and \(\textrm{traps}\). How should we model the number of complaints?
We already know some rudimentary information about what we should expect. The number of complaints over a month should be either zero or an integer. The property manager tells us that it is possible but unlikely that number of complaints in a given month is zero. Occasionally there are a very large number of complaints in a single month. A common way of modeling this sort of skewed, single bounded count data is as a Poisson random variable. One concern about modeling the outcome variable as Poisson is that the data may be over-dispersed, but we’ll start with the Poisson model and then check whether over-dispersion is a problem by comparing our model’s predictions to the data.
Given that we have chosen a Poisson regression, we define the likelihood to be the Poisson probability mass function over the number bait stations placed in the building, denoted below as traps. This model assumes that the mean and variance of the outcome variable complaints (number of complaints) is the same. We’ll investigate whether this is a good assumption after we fit the model.
For building \(b = 1,\dots,10\) at time (month) \(t = 1,\dots,12\), we have
\[ \begin{align*} \textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\ \lambda_{b,t} & = \exp{(\eta_{b,t})} \\ \eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} \end{align*} \]
Let’s encode this probability model in a Stan program.
Now we can fit the model by calling the sampling() function.
To fit the model to the actual observed data we’ll first create a list to pass to Stan using the variables in the pest_data data frame:
stan_dat_simple <- list(
N = nrow(pest_data),
complaints = pest_data$complaints,
traps = pest_data$traps
)
str(stan_dat_simple)
List of 3
$ N : int 120
$ complaints: num [1:120] 1 3 0 1 0 0 4 3 2 2 ...
$ traps : num [1:120] 8 8 9 10 11 11 10 10 9 9 ...
As we have already compiled the model, we can jump straight to sampling from it.
comp_model_P <- stan_model('stan_programs/simple_poisson_regression.stan')
fit_P_real_data <- sampling(comp_model_P, data = stan_dat_simple)
SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.133 seconds (Warm-up)
Chain 1: 0.149 seconds (Sampling)
Chain 1: 0.282 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 0 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.135 seconds (Warm-up)
Chain 2: 0.139 seconds (Sampling)
Chain 2: 0.274 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 0.001 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 10 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.147 seconds (Warm-up)
Chain 3: 0.146 seconds (Sampling)
Chain 3: 0.293 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'simple_poisson_regression' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 0 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.143 seconds (Warm-up)
Chain 4: 0.128 seconds (Sampling)
Chain 4: 0.271 seconds (Total)
Chain 4:
and printing the parameters. What do these tell us?
print(fit_P_real_data, pars = c('alpha','beta'))
Inference for Stan model: simple_poisson_regression.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
alpha 2.58 0.01 0.16 2.28 2.48 2.58 2.69 2.90 686 1
beta -0.19 0.00 0.02 -0.24 -0.21 -0.19 -0.18 -0.15 711 1
Samples were drawn using NUTS(diag_e) at Fri Dec 03 15:02:40 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
We can also plot the posterior distributions:
mcmc_hist(as.matrix(fit_P_real_data, pars = c('alpha','beta')))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
mcmc_scatter(as.matrix(fit_P_real_data, pars = c('alpha','beta')), alpha = 0.2)
As we expected, it appears the number of bait stations set in a building is associated with the number of complaints about cockroaches that were made in the following month. However, we still need to consider how well the model fits.
y_rep <- as.matrix(fit_P_real_data, pars = "y_rep")
ppc_dens_overlay(y = stan_dat_simple$complaints, y_rep[1:200,])
The simulated datasets is not as dispersed as the observed data and don’t seem to capture the rate of zeros in the observed data. The Poisson model may not be sufficient for this data.
Let’s explore this further by looking directly at the proportion of zeros in the real data and predicted data.
prop_zero <- function(x) mean(x == 0)
ppc_stat(y = stan_dat_simple$complaints, yrep = y_rep, stat = "prop_zero")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The plot above shows the observed proportion of zeros (thick vertical line) and a histogram of the proportion of zeros in each of the simulated data sets. It is clear that the model does not capture this feature of the data well at all.
This next plot is a plot of the standardised residuals of the observed vs predicted number of complaints.
mean_y_rep <- colMeans(y_rep)
std_resid <- (stan_dat_simple$complaints - mean_y_rep) / sqrt(mean_y_rep)
qplot(mean_y_rep, std_resid) + hline_at(2) + hline_at(-2)
As you can see here, it looks as though we have more positive residuals than negative, which indicates that the model tends to underestimate the number of complaints that will be received.
We can also view how the predicted number of complaints varies with the number of traps. From this we can see that the model doesn’t seem to fully capture the data.
ppc_intervals(
y = stan_dat_simple$complaints,
yrep = y_rep,
x = stan_dat_simple$traps
) +
labs(x = "Number of traps", y = "Number of complaints")
Specifically, the model doesn’t capture the tails of the observed data very well.
Modeling the relationship between complaints and bait stations is the simplest model. We can expand the model, however, in a few ways that will be beneficial for our client. Moreover, the manager has told us that they expect there are a number of other reasons that one building might have more roach complaints than another.
Currently, our model’s mean parameter is a rate of complaints per 30 days, but we’re modeling a process that occurs over an area as well as over time. We have the square footage of each building, so if we add that information into the model, we can interpret our parameters as a rate of complaints per square foot per 30 days.
\[ \begin{align*} \textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\textrm{sq_foot}_b\,\lambda_{b,t}) \\ \lambda_{b,t} & = \exp{(\eta_{b,t} )} \\ \eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} \end{align*} \]
The term \(\text{sq_foot}\) is called an exposure term. If we log the term, we can put it in \(\eta_{b,t}\):
\[ \begin{align*} \textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\ \lambda_{b,t} & = \exp{(\eta_{b,t} )} \\ \eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} + \textrm{log_sq_foot}_b \end{align*} \]
A quick test shows us that there appears to be a relationship between the square footage of the building and the number of complaints received:
ggplot(pest_data, aes(x = log(total_sq_foot), y = log1p(complaints))) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
`geom_smooth()` using formula 'y ~ x'
Using the property manager’s intuition, we include two extra pieces of information we know about the building - the (log of the) square floor space and whether there is a live in super or not - into both the simulated and real data.
stan_dat_simple$log_sq_foot <- log(pest_data$total_sq_foot/1e4)
stan_dat_simple$live_in_super <- pest_data$live_in_super
Now we need a new Stan model that uses multiple predictors.
And then compile and fit the model we wrote for the multiple regression.
Now let’s use the real data and explore the fit.
comp_model_P_mult <- stan_model('stan_programs/multiple_poisson_regression.stan')
fit_model_P_mult_real <- sampling(comp_model_P_mult, data = stan_dat_simple)
SAMPLING FOR MODEL 'multiple_poisson_regression' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.352 seconds (Warm-up)
Chain 1: 0.347 seconds (Sampling)
Chain 1: 0.699 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'multiple_poisson_regression' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 0 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.35 seconds (Warm-up)
Chain 2: 0.3 seconds (Sampling)
Chain 2: 0.65 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'multiple_poisson_regression' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 0 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.369 seconds (Warm-up)
Chain 3: 0.315 seconds (Sampling)
Chain 3: 0.684 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'multiple_poisson_regression' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 0 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.367 seconds (Warm-up)
Chain 4: 0.3 seconds (Sampling)
Chain 4: 0.667 seconds (Total)
Chain 4:
y_rep <- as.matrix(fit_model_P_mult_real, pars = "y_rep")
ppc_dens_overlay(stan_dat_simple$complaints, y_rep[1:200,])
This again looks like we haven’t captured the smaller counts very well, nor have we captured the larger counts.
prop_zero <- function(x) mean(x == 0)
ppc_stat(y = stan_dat_simple$complaints, yrep = y_rep, stat = "prop_zero", binwidth = 0.01)
We’re still severely underestimating the proportion of zeros in the data. Ideally this vertical line would fall somewhere within the histogram.
We can also plot uncertainty intervals for the predicted complaints for different numbers of traps.
ppc_intervals(
y = stan_dat_simple$complaints,
yrep = y_rep,
x = stan_dat_simple$traps
) +
labs(x = "Number of traps", y = "Number of complaints")
We can see that we’ve increased the tails a bit more at the larger numbers of traps but we still have some large observed numbers of complaints that the model would consider extremely unlikely events.
When we considered modelling the data using a Poisson, we saw that the model didn’t appear to fit as well to the data as we would like. In particular the model underpredicted low and high numbers of complaints, and overpredicted the medium number of complaints. This is one indication of over-dispersion, where the variance is larger than the mean. A Poisson model doesn’t fit over-dispersed count data very well because the same parameter \(\lambda\), controls both the expected counts and the variance of these counts. The natural alternative to this is the negative binomial model:
\[ \begin{align*} \text{complaints}_{b,t} & \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\ \lambda_{b,t} & = \exp{(\eta_{b,t})} \\ \eta_{b,t} &= \alpha + \beta \, {\rm traps}_{b,t} + \beta_{\rm super} \, {\rm super}_{b} + \text{log_sq_foot}_{b} \end{align*} \]
In Stan the negative binomial mass function we’ll use is called \(\texttt{neg_binomial_2_log}(\text{ints} \, y, \text{reals} \, \eta, \text{reals} \, \phi)\) in Stan. Like the poisson_log function, this negative binomial mass function that is parameterized in terms of its log-mean, \(\eta\), but it also has a precision \(\phi\) such that
\[ \mathbb{E}[y] \, = \lambda = \exp(\eta) \]
\[ \text{Var}[y] = \lambda + \lambda^2/\phi = \exp(\eta) + \exp(\eta)^2 / \phi. \]
As \(\phi\) gets larger the term \(\lambda^2 / \phi\) approaches zero and so the variance of the negative-binomial approaches \(\lambda\), i.e., the negative-binomial gets closer and closer to the Poisson.
comp_model_NB <- stan_model('stan_programs/multiple_NB_regression.stan')
fitted_model_NB <- sampling(comp_model_NB, data = stan_dat_simple)
SAMPLING FOR MODEL 'multiple_NB_regression' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 1.212 seconds (Warm-up)
Chain 1: 1.102 seconds (Sampling)
Chain 1: 2.314 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'multiple_NB_regression' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 0 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 1.306 seconds (Warm-up)
Chain 2: 1.263 seconds (Sampling)
Chain 2: 2.569 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'multiple_NB_regression' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 0 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 1.35 seconds (Warm-up)
Chain 3: 1.187 seconds (Sampling)
Chain 3: 2.537 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'multiple_NB_regression' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 0 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 1.223 seconds (Warm-up)
Chain 4: 1.077 seconds (Sampling)
Chain 4: 2.3 seconds (Total)
Chain 4:
samps_NB <- rstan::extract(fitted_model_NB)
Let’s look at our predictions vs. the data.
y_rep <- samps_NB$y_rep
ppc_dens_overlay(stan_dat_simple$complaints, y_rep[1:200,])
It appears that our model now captures both the number of small counts better as well as the tails.
Let’s check if the negative binomial model does a better job capturing the number of zeros:
ppc_stat(y = stan_dat_simple$complaints, yrep = y_rep, stat = "prop_zero")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
These look OK, but let’s look at the standardized residual plot.
mean_inv_phi <- mean(samps_NB$inv_phi)
mean_y_rep <- colMeans(y_rep)
std_resid <- (stan_dat_simple$complaints - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_resid) + hline_at(2) + hline_at(-2)
Looks OK, but we still have some very large standardized residuals. This might be because we are currently ignoring that the data are clustered by buildings, and that the probability of roach issue may vary substantially across buildings.
Check predictions by number of traps:
ppc_intervals(
y = stan_dat_simple$complaints,
yrep = y_rep,
x = stan_dat_simple$traps
) +
labs(x = "Number of traps", y = "Number of complaints")
We haven’t used the fact that the data are clustered by building yet. A posterior predictive check might elucidate whether it would be a good idea to add the building information into the model.
ppc_stat_grouped(
y = stan_dat_simple$complaints,
yrep = y_rep,
group = pest_data$building_id,
stat = 'mean',
binwidth = 0.2
)
We’re getting plausible predictions for most building means but some are estimated better than others and some have larger uncertainties than we might expect. If we explicitly model the variation across buildings we may be able to get much better estimates.
Let’s add a hierarchical intercept parameter, \(\alpha_b\) at the building level to our model.
\[ \text{complaints}_{b,t} \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\ \lambda_{b,t} = \exp{(\eta_{b,t})} \\ \eta_{b,t} = \mu_b + \beta \, {\rm traps}_{b,t} + \beta_{\rm super}\, {\rm super}_b + \text{log_sq_foot}_b \\ \mu_b \sim \text{Normal}(\alpha, \sigma_{\mu}) \]
In our Stan model, \(\mu_b\) is the \(b\)-th element of the vector \(\texttt{mu}\) which has one element per building.
One of our predictors varies only by building, so we can rewrite the above model more efficiently like so:
\[ \eta_{b,t} = \mu_b + \beta \, {\rm traps}_{b,t} + \text{log_sq_foot}_b\\ \mu_b \sim \text{Normal}(\alpha + \beta_{\text{super}} \, \text{super}_b , \sigma_{\mu}) \]
We have more information at the building level as well, like the average age of the residents, the average age of the buildings, and the average per-apartment monthly rent so we can add that data into a matrix called building_data, which will have one row per building and four columns:
live_in_superage_of_buildingaverage_tentant_agemonthly_average_rentWe’ll write the Stan model like:
\[ \eta_{b,t} = \mu_b + \beta \, {\rm traps} + \text{log_sq_foot}\\ \mu \sim \text{Normal}(\alpha + \texttt{building_data} \, \zeta, \,\sigma_{\mu}) \]
We’ll need to do some more data prep before we can fit our models. Firstly to use the building variable in Stan we will need to transform it from a factor variable to an integer variable.
N_months <- length(unique(pest_data$date))
# Add some IDs for building and month
pest_data <- pest_data %>%
mutate(
building_fac = factor(building_id, levels = unique(building_id)),
building_idx = as.integer(building_fac),
ids = rep(1:N_months, N_buildings),
mo_idx = lubridate::month(date)
)
# Center and rescale the building specific data
building_data <- pest_data %>%
select(
building_idx,
live_in_super,
age_of_building,
total_sq_foot,
average_tenant_age,
monthly_average_rent
) %>%
unique() %>%
arrange(building_idx) %>%
select(-building_idx) %>%
scale(scale=FALSE) %>%
as.data.frame() %>%
mutate( # scale by constants
age_of_building = age_of_building / 10,
total_sq_foot = total_sq_foot / 10000,
average_tenant_age = average_tenant_age / 10,
monthly_average_rent = monthly_average_rent / 1000
) %>%
as.matrix()
# Make data list for Stan
stan_dat_hier <-
with(pest_data,
list(complaints = complaints,
traps = traps,
N = length(traps),
J = N_buildings,
M = N_months,
log_sq_foot = log(pest_data$total_sq_foot/1e4),
building_data = building_data[,-3],
mo_idx = as.integer(as.factor(date)),
K = 4,
building_idx = building_idx
)
)
Let’s compile the model.
comp_model_NB_hier <- stan_model('stan_programs/hier_NB_regression.stan')
Fit the model to data.
fitted_model_NB_hier <-
sampling(
comp_model_NB_hier,
data = stan_dat_hier,
chains = 4,
cores = 4,
iter = 4000
)
Warning: There were 633 divergent transitions after warmup. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.
Warning: Examine the pairs() plot to diagnose sampling problems
Warning: Bulk Effective Samples Size (ESS) is too low, indicating posterior means and medians may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#bulk-ess
Warning: Tail Effective Samples Size (ESS) is too low, indicating posterior variances and tail quantiles may be unreliable.
Running the chains for more iterations may help. See
http://mc-stan.org/misc/warnings.html#tail-ess
We get a bunch of warnings from Stan about divergent transitions, which is an indication that there may be regions of the posterior that have not been explored by the Markov chains.
Divergences are discussed in more detail in the course slides as well as the bayesplot (MCMC diagnostics vignette)[http://mc-stan.org/bayesplot/articles/visual-mcmc-diagnostics.html] and A Conceptual Introduction to Hamiltonian Monte Carlo.
In this example we will see that we have divergent transitions because we need to reparameterize our model - i.e., we will retain the overall structure of the model, but transform some of the parameters so that it is easier for Stan to sample from the parameter space. Before we go through exactly how to do this reparameterization, we will first go through what indicates that this is something that reparameterization will resolve. We will go through:
First let’s extract the fits from the model.
samps_hier_NB <- rstan::extract(fitted_model_NB_hier)
Then we print the fits for the parameters that are of most interest.
print(fitted_model_NB_hier, pars = c('sigma_mu','beta','alpha','phi','mu'))
Inference for Stan model: hier_NB_regression.
4 chains, each with iter=4000; warmup=2000; thin=1;
post-warmup draws per chain=2000, total post-warmup draws=8000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
sigma_mu 0.26 0.01 0.17 0.06 0.13 0.22 0.34 0.68 568 1.01
beta -0.23 0.00 0.06 -0.35 -0.27 -0.23 -0.19 -0.11 431 1.00
alpha 1.24 0.02 0.44 0.41 0.93 1.22 1.54 2.10 442 1.00
phi 1.58 0.01 0.35 1.04 1.32 1.53 1.77 2.40 1565 1.00
mu[1] 1.25 0.02 0.55 0.16 0.87 1.23 1.63 2.33 510 1.01
mu[2] 1.21 0.02 0.54 0.17 0.84 1.17 1.58 2.26 550 1.01
mu[3] 1.39 0.02 0.50 0.47 1.04 1.38 1.73 2.36 449 1.00
mu[4] 1.42 0.02 0.49 0.50 1.08 1.41 1.75 2.37 544 1.00
mu[5] 1.06 0.02 0.42 0.25 0.77 1.06 1.35 1.90 626 1.00
mu[6] 1.15 0.02 0.50 0.21 0.81 1.14 1.48 2.10 403 1.01
mu[7] 1.43 0.02 0.52 0.40 1.09 1.43 1.77 2.44 568 1.00
mu[8] 1.24 0.02 0.44 0.37 0.94 1.24 1.53 2.08 635 1.00
mu[9] 1.38 0.03 0.57 0.27 1.00 1.38 1.78 2.49 519 1.00
mu[10] 0.85 0.02 0.37 0.18 0.59 0.83 1.09 1.62 543 1.01
Samples were drawn using NUTS(diag_e) at Fri Dec 03 15:21:54 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
You can see that the effective samples are quite low for many of the parameters relative to the total number of samples. This alone isn’t indicative of the need to reparameterize, but it indicates that we should look further at the trace plots and pairs plots. First let’s look at the traceplots to see if the divergent transitions form a pattern.
# use as.array to keep the markov chains separate for trace plots
mcmc_trace(
as.array(fitted_model_NB_hier,pars = 'sigma_mu'),
np = nuts_params(fitted_model_NB_hier),
window = c(500,1000)
)
Looks as if the divergent parameters, the little red bars underneath the traceplots correspond to samples where the sampler gets stuck at one parameter value for \(\sigma_\mu\).
# assign to object so we can compare to another plot later
scatter_with_divs <- mcmc_scatter(
as.array(fitted_model_NB_hier),
pars = c("mu[4]", 'sigma_mu'),
transform = list('sigma_mu' = "log"),
np = nuts_params(fitted_model_NB_hier)
)
scatter_with_divs
What we have here is a cloud-like shape, with most of the divergences clustering towards the bottom.
Another way to look at the divergences is via a parallel coordinates plot:
parcoord_with_divs <- mcmc_parcoord(
as.array(fitted_model_NB_hier, pars = c("sigma_mu", "mu")),
np = nuts_params(fitted_model_NB_hier)
)
parcoord_with_divs
Again, we see evidence that our problems concentrate when \(\texttt{sigma_mu}\) is small.
Instead, we should use the non-centered parameterization for \(\mu_b\). We define a vector of auxiliary variables in the parameters block, \(\texttt{mu_raw}\) that is given a \(\text{Normal}(0, 1)\) prior in the model block. We then make \(\texttt{mu}\) a transformed parameter: We can reparameterize the random intercept \(\mu_b\), which is distributed:
\[ \mu_b \sim \text{Normal}(\alpha + \texttt{building_data} \, \zeta, \sigma_{\mu}) \]
transformed parameters {
vector[J] mu;
mu = alpha + building_data * zeta + sigma_mu * mu_raw;
}
This gives \(\texttt{mu}\) a \(\text{Normal}(\alpha + \texttt{building_data}\, \zeta, \sigma_\mu)\) distribution, but it decouples the dependence of the density of each element of \(\texttt{mu}\) from \(\texttt{sigma_mu}\) (\(\sigma_\mu\)). hier_NB_regression_ncp.stan uses the non-centered parameterization for \(\texttt{mu}\). We will examine the effective sample size of the fitted model to see whether we’ve fixed the problem with our reparameterization.
Compile the model.
comp_model_NB_hier_ncp <- stan_model('stan_programs/hier_NB_regression_ncp.stan')
Fit the model to the data.
fitted_model_NB_hier_ncp <- sampling(comp_model_NB_hier_ncp, data = stan_dat_hier, chains = 4, cores = 4)
Examining the fit of the new model
print(fitted_model_NB_hier_ncp, pars = c('sigma_mu','beta','alpha','phi','mu'))
Inference for Stan model: hier_NB_regression_ncp.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
sigma_mu 0.23 0.00 0.17 0.01 0.10 0.19 0.32 0.62 1425 1
beta -0.23 0.00 0.06 -0.35 -0.27 -0.23 -0.19 -0.11 2964 1
alpha 1.27 0.01 0.43 0.46 0.98 1.26 1.56 2.14 3017 1
phi 1.58 0.01 0.34 1.02 1.34 1.54 1.77 2.35 4081 1
mu[1] 1.30 0.01 0.55 0.21 0.93 1.30 1.67 2.37 2948 1
mu[2] 1.24 0.01 0.53 0.21 0.89 1.22 1.59 2.30 3098 1
mu[3] 1.40 0.01 0.49 0.48 1.07 1.40 1.73 2.41 3177 1
mu[4] 1.44 0.01 0.50 0.53 1.10 1.43 1.78 2.44 3070 1
mu[5] 1.09 0.01 0.42 0.29 0.81 1.08 1.36 1.97 3378 1
mu[6] 1.20 0.01 0.48 0.27 0.87 1.19 1.52 2.18 3055 1
mu[7] 1.47 0.01 0.52 0.49 1.12 1.46 1.81 2.51 3242 1
mu[8] 1.25 0.01 0.43 0.40 0.96 1.25 1.54 2.10 3478 1
mu[9] 1.44 0.01 0.57 0.30 1.06 1.43 1.83 2.51 3166 1
mu[10] 0.87 0.01 0.37 0.17 0.62 0.86 1.11 1.64 3872 1
Samples were drawn using NUTS(diag_e) at Fri Dec 03 15:25:16 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
This has improved the effective sample sizes of \(\texttt{mu}\). We extract the parameters to run our usual posterior predictive checks.
scatter_no_divs <- mcmc_scatter(
as.array(fitted_model_NB_hier_ncp),
pars = c("mu[4]", 'sigma_mu'),
transform = list('sigma_mu' = "log"),
np = nuts_params(fitted_model_NB_hier_ncp)
)
bayesplot_grid(scatter_with_divs, scatter_no_divs,
grid_args = list(ncol = 2), ylim = c(-11, 1))
Warning: Removed 1 rows containing missing values (geom_point).
parcoord_no_divs <- mcmc_parcoord(
as.array(fitted_model_NB_hier_ncp, pars = c("sigma_mu", "mu")),
np = nuts_params(fitted_model_NB_hier_ncp)
)
bayesplot_grid(parcoord_with_divs, parcoord_no_divs,
ylim = c(-3, 3))
samps_NB_hier_ncp <- rstan::extract(fitted_model_NB_hier_ncp, pars = c('y_rep','inv_phi'))
The marginal plot, again.
y_rep <- as.matrix(fitted_model_NB_hier_ncp, pars = "y_rep")
ppc_dens_overlay(stan_dat_hier$complaints, y_rep[1:200,])
This looks quite nice. If we’ve captured the building-level means well, then the posterior distribution of means by building should match well with the observed means of the quantity of building complaints by month.
ppc_stat_grouped(
y = stan_dat_hier$complaints,
yrep = y_rep,
group = pest_data$building_id,
stat = 'mean',
binwidth = 0.5
)
We weren’t doing terribly with the building-specific means before, but now they are all estimated a bit better by the model. The model is also able to do a decent job estimating the probability of zero complaints:
prop_zero <- function(x) mean(x == 0)
ppc_stat(
y = stan_dat_hier$complaints,
yrep = y_rep,
stat = prop_zero,
binwidth = 0.025
)
# plot separately for each building
ppc_stat_grouped(
y = stan_dat_hier$complaints,
yrep = y_rep,
group = pest_data$building_id,
stat = prop_zero,
binwidth = 0.025
)
Predictions by number of traps:
ppc_intervals(
y = stan_dat_hier$complaints,
yrep = y_rep,
x = stan_dat_hier$traps
) +
labs(x = "Number of traps", y = "Number of complaints")
Standardized residuals:
mean_y_rep <- colMeans(y_rep)
mean_inv_phi <- mean(as.matrix(fitted_model_NB_hier_ncp, pars = "inv_phi"))
std_resid <- (stan_dat_hier$complaints - mean_y_rep) / sqrt(mean_y_rep + mean_y_rep^2*mean_inv_phi)
qplot(mean_y_rep, std_resid) + hline_at(2) + hline_at(-2)
We’ve gotten some new data that extends the number of time points for which we have observations for each building. This will let us explore how to expand the model a bit more with varying slopes in addition to the varying intercepts and also, later, also model temporal variation.
stan_dat_hier <- readRDS('data/pest_data_longer_stan_dat.RDS')
Perhaps if the levels of complaints differ by building, the coefficient for the effect of traps on building does too. We can add this to our model and observe the fit.
\[ \text{complaints}_{b,t} \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\ \lambda_{b,t} = \exp{(\eta_{b,t})}\\ \eta_{b,t} = \mu_b + \kappa_b \, \texttt{traps}_{b,t} + \text{log_sq_foot}_b \\ \mu_b \sim \text{Normal}(\alpha + \texttt{building_data} \, \zeta, \sigma_{\mu}) \\ \kappa_b \sim \text{Normal}(\beta + \texttt{building_data} \, \gamma, \sigma_{\kappa}) \]
Let’s compile the model.
comp_model_NB_hier_slopes <- stan_model('stan_programs/hier_NB_regression_ncp_slopes_mod.stan')
Fit the model to data and extract the posterior draws needed for our posterior predictive checks.
fitted_model_NB_hier_slopes <-
sampling(
comp_model_NB_hier_slopes,
data = stan_dat_hier,
chains = 4, cores = 4,
control = list(adapt_delta = 0.95)
)
Warning: There were 2 divergent transitions after warmup. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.
Warning: Examine the pairs() plot to diagnose sampling problems
To see if the model infers building-to-building differences in, we can plot a histogram of our marginal posterior distribution for sigma_kappa.
mcmc_hist(
as.matrix(fitted_model_NB_hier_slopes, pars = "sigma_kappa"),
binwidth = 0.005
)
print(fitted_model_NB_hier_slopes, pars = c('kappa','beta','alpha','phi','sigma_mu','sigma_kappa','mu'))
Inference for Stan model: hier_NB_regression_ncp_slopes_mod.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
kappa[1] -0.02 0.00 0.07 -0.14 -0.07 -0.03 0.01 0.15 1266 1.00
kappa[2] -0.41 0.00 0.10 -0.62 -0.47 -0.40 -0.35 -0.24 1784 1.00
kappa[3] -0.59 0.00 0.10 -0.79 -0.65 -0.58 -0.52 -0.39 5262 1.00
kappa[4] -0.22 0.00 0.07 -0.36 -0.26 -0.22 -0.18 -0.09 3002 1.00
kappa[5] -0.60 0.00 0.09 -0.78 -0.66 -0.60 -0.54 -0.43 5518 1.00
kappa[6] -0.43 0.00 0.10 -0.65 -0.49 -0.43 -0.37 -0.25 3138 1.00
kappa[7] -0.31 0.00 0.07 -0.44 -0.36 -0.31 -0.27 -0.18 5379 1.00
kappa[8] -0.22 0.00 0.15 -0.55 -0.31 -0.22 -0.12 0.04 2272 1.00
kappa[9] 0.08 0.00 0.06 -0.03 0.04 0.08 0.12 0.19 5283 1.00
kappa[10] -0.73 0.00 0.15 -1.00 -0.83 -0.74 -0.64 -0.41 1758 1.00
beta -0.35 0.00 0.06 -0.46 -0.38 -0.34 -0.31 -0.23 2720 1.00
alpha 1.41 0.01 0.29 0.81 1.23 1.42 1.60 1.95 2722 1.00
phi 1.61 0.00 0.19 1.27 1.48 1.60 1.73 2.01 4554 1.00
sigma_mu 0.45 0.01 0.37 0.02 0.16 0.36 0.64 1.43 749 1.01
sigma_kappa 0.12 0.00 0.08 0.02 0.07 0.10 0.15 0.31 684 1.01
mu[1] 0.34 0.02 0.69 -1.35 0.00 0.43 0.79 1.43 1250 1.00
mu[2] 1.62 0.01 0.52 0.69 1.26 1.57 1.93 2.74 1761 1.00
mu[3] 2.12 0.00 0.32 1.51 1.90 2.12 2.34 2.79 5266 1.00
mu[4] 1.49 0.01 0.50 0.50 1.17 1.48 1.79 2.52 3166 1.00
mu[5] 2.39 0.01 0.41 1.62 2.10 2.39 2.66 3.21 6031 1.00
mu[6] 1.89 0.01 0.38 1.20 1.65 1.86 2.10 2.77 3063 1.00
mu[7] 2.69 0.00 0.26 2.21 2.51 2.68 2.86 3.22 5042 1.00
mu[8] -0.58 0.02 0.91 -2.26 -1.19 -0.60 -0.03 1.44 2446 1.00
mu[9] 0.24 0.01 0.56 -0.85 -0.13 0.24 0.61 1.33 5152 1.00
mu[10] 1.87 0.03 1.03 -0.55 1.30 2.01 2.59 3.52 1282 1.00
Samples were drawn using NUTS(diag_e) at Fri Dec 03 15:28:49 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
mcmc_hist(
as.matrix(fitted_model_NB_hier_slopes, pars = "beta"),
binwidth = 0.005
)
While the model can’t specifically rule out zero from the posterior, it does have mass at small non-zero numbers, so we should leave in the hierarchy over \(\texttt{kappa}\). Plotting the marginal data density again, we can see the model still looks well calibrated.
y_rep <- as.matrix(fitted_model_NB_hier_slopes, pars = "y_rep")
ppc_dens_overlay(
y = stan_dat_hier$complaints,
yrep = y_rep[1:200,]
)
We haven’t looked at how cockroach complaints change over time. Let’s look at whether there’s any pattern over time.
select_vec <- which(stan_dat_hier$mo_idx %in% 1:12)
ppc_stat_grouped(
y = stan_dat_hier$complaints[select_vec],
yrep = y_rep[,select_vec],
group = stan_dat_hier$mo_idx[select_vec],
stat = 'mean'
) + xlim(0, 11)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We might augment our model with a log-additive monthly effect, \(\texttt{mo}_t\).
\[ \eta_{b,t} = \mu_b + \kappa_b \, \texttt{traps}_{b,t} + \texttt{mo}_t + \text{log_sq_foot}_b \]
We have complete freedom over how to specify the prior for \(\texttt{mo}_t\). There are several competing factors for how the number of complaints might change over time. It makes sense that there might be more roaches in the environment during the summer, but we might also expect that there is more roach control in the summer as well. Given that we’re modeling complaints, maybe after the first sighting of roaches in a building, residents are more vigilant, and thus complaints of roaches would increase.
This can be a motivation for using an autoregressive prior for our monthly effects. The model is as follows:
\[ \texttt{mo}_t \sim \text{Normal}(\rho \, \texttt{mo}_{t-1}, \sigma_\texttt{mo}) \\ \equiv \\ \texttt{mo}_t = \rho \, \texttt{mo}_{t-1} +\epsilon_t , \quad \epsilon_t \sim \text{Normal}(0, \sigma_\texttt{mo}) \\ \quad \rho \in [-1,1] \]
This equation says that the monthly effect in month \(t\) is directly related to the last month’s monthly effect. Given the description of the process above, it seems like there could be either positive or negative associations between the months, but there should be a bit more weight placed on positive \(\rho\)s, so we’ll put an informative prior that pushes the parameter \(\rho\) towards 0.5.
Before we write our prior, however, we have a problem: Stan doesn’t implement any densities that have support on \([-1,1]\). We can use variable transformation of a raw variable defined on \([0,1]\) to to give us a density on \([-1,1]\). Specifically,
\[ \rho_{\text{raw}} \in [0, 1] \\ \rho = 2 \times \rho_{\text{raw}} - 1 \]
Then we can put a beta prior on \(\rho_\text{raw}\) to push our estimate towards 0.5.
One further wrinkle is that we have a prior for \(\texttt{mo}_t\) that depends on \(\texttt{mo}_{t-1}\). That is, we are working with the conditional distribution of \(\texttt{mo}_t\) given \(\texttt{mo}_{t-1}\). But what should we do about the prior for \(\texttt{mo}_1\), for which we don’t have a previous time period in the data?
We need to work out the marginal distribution of the first observation. Thankfully we can use the fact that AR models are stationary, so \(\text{Var}(\texttt{mo}_t) = \text{Var}(\texttt{mo}_{t-1})\) and \(\mathbb{E}(\texttt{mo}_t) = \mathbb{E}(\texttt{mo}_{t-1})\) for all \(t\). Therefore the marginal distribution of \(\texttt{mo}_1\) is the same as the marginal distribution of any \(\texttt{mo}_t\).
First we derive the marginal variance of \(\texttt{mo}_{t}\).
\[ \text{Var}(\texttt{mo}_t) = \text{Var}(\rho \texttt{mo}_{t-1} + \epsilon_t) \\ \text{Var}(\texttt{mo}_t) = \text{Var}(\rho \texttt{mo}_{t-1}) + \text{Var}(\epsilon_t) \] where the second line holds by independence of \(\epsilon_t\) and \(\epsilon_{t-1})\). Then, using the fact that \(Var(cX) = c^2Var(X)\) for a constant \(c\) and the fact that, by stationarity, \(\textrm{Var}(\texttt{mo}_{t-1}) = \textrm{Var}(\texttt{mo}_{t})\), we then obtain:
\[ \text{Var}(\texttt{mo}_t)= \rho^2 \text{Var}( \texttt{mo}_{t}) + \sigma_\texttt{mo}^2 \\ \text{Var}(\texttt{mo}_t) = \frac{\sigma_\texttt{mo}^2}{1 - \rho^2} \]
For the mean of \(\texttt{mo}_t\) things are a bit simpler:
\[ \mathbb{E}(\texttt{mo}_t) = \mathbb{E}(\rho \, \texttt{mo}_{t-1} + \epsilon_t) \\ \mathbb{E}(\texttt{mo}_t) = \mathbb{E}(\rho \, \texttt{mo}_{t-1}) + \mathbb{E}(\epsilon_t) \\ \]
Since \(\mathbb{E}(\epsilon_t) = 0\) by assumption we have
\[ \mathbb{E}(\texttt{mo}_t) = \mathbb{E}(\rho \, \texttt{mo}_{t-1}) + 0\\ \mathbb{E}(\texttt{mo}_t) = \rho \, \mathbb{E}(\texttt{mo}_{t}) \\ \mathbb{E}(\texttt{mo}_t) - \rho \mathbb{E}(\texttt{mo}_t) = 0 \\ \mathbb{E}(\texttt{mo}_t) = 0/(1 - \rho) \]
which for \(\rho \neq 1\) yields \(\mathbb{E}(\texttt{mo}_{t}) = 0\).
We now have the marginal distribution for \(\texttt{mo}_{t}\), which, in our case, we will use for \(\texttt{mo}_1\). The full AR(1) specification is then:
\[ \texttt{mo}_1 \sim \text{Normal}\left(0, \frac{\sigma_\texttt{mo}}{\sqrt{1 - \rho^2}}\right) \\ \texttt{mo}_t \sim \text{Normal}\left(\rho \, \texttt{mo}_{t-1}, \sigma_\texttt{mo}\right) \forall t > 1 \]
comp_model_NB_hier_mos <- stan_model('stan_programs/hier_NB_regression_ncp_slopes_mod_mos.stan')
fitted_model_NB_hier_mos <- sampling(comp_model_NB_hier_mos, data = stan_dat_hier, chains = 4, cores = 4, control = list(adapt_delta = 0.9))
Warning: There were 8 divergent transitions after warmup. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.
Warning: Examine the pairs() plot to diagnose sampling problems
In the interest of brevity, we won’t go on expanding the model, though we certainly could. What other information would help us understand the data generating process better? What other aspects of the data generating process might we want to capture that we’re not capturing now?
As usual, we run through our posterior predictive checks.
y_rep <- as.matrix(fitted_model_NB_hier_mos, pars = "y_rep")
ppc_dens_overlay(
y = stan_dat_hier$complaints,
yrep = y_rep[1:200,]
)
select_vec <- which(stan_dat_hier$mo_idx %in% 1:12)
ppc_stat_grouped(
y = stan_dat_hier$complaints[select_vec],
yrep = y_rep[,select_vec],
group = stan_dat_hier$mo_idx[select_vec],
stat = 'mean'
)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As we can see, our monthly random intercept has captured a monthly pattern across all the buildings. We can also compare the prior and posterior for the autoregressive parameter to see how much we’ve learned. Here are two different ways of comparing the prior and posterior visually:
# 1) compare draws from prior and draws from posterior
rho_draws <- cbind(
2 * rbeta(4000, 10, 5) - 1, # draw from prior
as.matrix(fitted_model_NB_hier_mos, pars = "rho")
)
colnames(rho_draws) <- c("prior", "posterior")
mcmc_hist(rho_draws, freq = FALSE, binwidth = 0.025,
facet_args = list(nrow = 2)) + xlim(-1, 1)
# 2) overlay prior density curve on posterior draws
gen_rho_prior <- function(x) {
alpha <- 10; beta <- 5
a <- -1; c <- 1
lp <- (alpha - 1) * log(x - a) +
(beta - 1) * log(c - x) -
(alpha + beta - 1) * log(c - a) -
lbeta(alpha, beta)
return(exp(lp))
}
mcmc_hist(as.matrix(fitted_model_NB_hier_mos, pars = "rho"),
freq = FALSE, binwidth = 0.01) +
overlay_function(fun = gen_rho_prior) +
xlim(-1,1)
print(fitted_model_NB_hier_mos, pars = c('rho','sigma_mu','sigma_kappa','gamma'))
Inference for Stan model: hier_NB_regression_ncp_slopes_mod_mos.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
rho 0.78 0.00 0.08 0.59 0.73 0.78 0.83 0.91 1604 1
sigma_mu 0.31 0.01 0.23 0.01 0.12 0.26 0.44 0.86 1591 1
sigma_kappa 0.09 0.00 0.06 0.01 0.05 0.08 0.11 0.22 903 1
gamma[1] -0.19 0.00 0.11 -0.40 -0.25 -0.19 -0.12 0.02 2009 1
gamma[2] 0.11 0.00 0.08 -0.03 0.07 0.11 0.16 0.27 1602 1
gamma[3] 0.10 0.00 0.15 -0.20 0.01 0.10 0.19 0.39 2540 1
gamma[4] 0.00 0.00 0.07 -0.14 -0.04 0.00 0.04 0.12 1368 1
Samples were drawn using NUTS(diag_e) at Fri Dec 03 15:30:49 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
ppc_intervals(
y = stan_dat_hier$complaints,
yrep = y_rep,
x = stan_dat_hier$traps
) +
labs(x = "Number of traps", y = "Number of complaints")
It looks as if our model finally generates a reasonable posterior predictive distribution for all numbers of traps, and appropriately captures the tails of the data generating process.
We use now the LOOIC (leave-one-out information criteria) and the WAIC (Watanabe-Akaike information criteria) contained in the loo package to compare models. The model yielding the lowest LOIIC/WAIC is the best in terms of predictive accuracy for out-of-sample predictions. We use here the extended dataset and re-fit the models.
library(loo)
# data (longer dataset)
pest_data <- readRDS('data/pest_data_longer_stan_dat.RDS')
str(pest_data)
## ----describe-data-------------------------------------------------------
N_buildings <- length(unique(pest_data$building_idx))
N_buildings
## Data acquisition
stan_dat_simple <- list(
N = length(pest_data$complaints),
complaints = pest_data$complaints,
traps = pest_data$traps
)
## Models fit
# simple poisson
comp_model_P <-
stan_model('stan_programs/simple_poisson_regression.stan')
fit_P_real_data <- sampling(comp_model_P, data = stan_dat_simple)
# multiple poisson
stan_dat_simple$log_sq_foot <- pest_data$log_sq_foot
stan_dat_simple$live_in_super <- pest_data$pred_mat[,"live_in_super"]
comp_model_P_mult <-
stan_model('stan_programs/multiple_poisson_regression.stan')
fit_model_P_mult_real <- sampling(comp_model_P_mult, data = stan_dat_simple)
# negative binomial
comp_model_NB <-
stan_model('stan_programs/multiple_NB_regression.stan')
fitted_model_NB <- sampling(comp_model_NB, data = stan_dat_simple)
# -----------------------
stan_dat_hier <- pest_data
# hier NB regression
comp_model_NB_hier <-
stan_model('stan_programs/hier_NB_regression.stan')
fitted_model_NB_hier <-
sampling(
comp_model_NB_hier,
data = stan_dat_hier,
chains = 4,
cores = 4,
iter = 4000
)
# hier NCP NB regression
comp_model_NB_hier_ncp <-
stan_model('stan_programs/hier_NB_regression_ncp.stan')
fitted_model_NB_hier_ncp <- sampling(comp_model_NB_hier_ncp,
data = stan_dat_hier,
chains = 4,
cores = 4,
iter=4000)
# hier NB slopes
comp_model_NB_hier_slopes <-
stan_model('stan_programs/hier_NB_regression_ncp_slopes_mod.stan')
fitted_model_NB_hier_slopes <-
sampling(
comp_model_NB_hier_slopes,
data = stan_dat_hier,
chains = 4, cores = 4,
control = list(adapt_delta = 0.95))
Warning: There were 2 divergent transitions after warmup. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
to find out why this is a problem and how to eliminate them.
Warning: Examine the pairs() plot to diagnose sampling problems
## Extract pointwise log-likelihood and
## compute loo and waic
# loo
log_lik_pois <- extract_log_lik(fit_P_real_data)
loo_pois <- loo(log_lik_pois)
Warning: Relative effective sample sizes ('r_eff' argument) not specified.
For models fit with MCMC, the reported PSIS effective sample sizes and
MCSE estimates will be over-optimistic.
Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
waic_pois <- waic(log_lik_pois)
Warning:
11 (3.1%) p_waic estimates greater than 0.4. We recommend trying loo instead.
log_lik_mult_pois <- extract_log_lik(fit_model_P_mult_real)
loo_mult_pois <- loo(log_lik_mult_pois)
Warning: Relative effective sample sizes ('r_eff' argument) not specified.
For models fit with MCMC, the reported PSIS effective sample sizes and
MCSE estimates will be over-optimistic.
Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.
waic_mult_pois <- waic(log_lik_mult_pois)
Warning:
19 (5.3%) p_waic estimates greater than 0.4. We recommend trying loo instead.
log_lik_mult_NB <- extract_log_lik(fitted_model_NB)
loo_mult_NB <- loo(log_lik_mult_NB)
Warning: Relative effective sample sizes ('r_eff' argument) not specified.
For models fit with MCMC, the reported PSIS effective sample sizes and
MCSE estimates will be over-optimistic.
waic_mult_NB <- waic(log_lik_mult_NB)
Warning:
1 (0.3%) p_waic estimates greater than 0.4. We recommend trying loo instead.
log_lik_hier_NB <- extract_log_lik(fitted_model_NB_hier)
loo_hier_NB <- loo(log_lik_hier_NB)
Warning: Relative effective sample sizes ('r_eff' argument) not specified.
For models fit with MCMC, the reported PSIS effective sample sizes and
MCSE estimates will be over-optimistic.
waic_hier_NB <- waic(log_lik_hier_NB)
Warning:
1 (0.3%) p_waic estimates greater than 0.4. We recommend trying loo instead.
log_lik_hier_NB_ncp <- extract_log_lik(fitted_model_NB_hier_ncp)
loo_hier_NB_ncp <- loo(log_lik_hier_NB_ncp)
Warning: Relative effective sample sizes ('r_eff' argument) not specified.
For models fit with MCMC, the reported PSIS effective sample sizes and
MCSE estimates will be over-optimistic.
waic_hier_NB_ncp <- waic(log_lik_hier_NB_ncp)
Warning:
1 (0.3%) p_waic estimates greater than 0.4. We recommend trying loo instead.
log_lik_slopes <- extract_log_lik(fitted_model_NB_hier_slopes)
loo_slopes <- loo(log_lik_slopes)
Warning: Relative effective sample sizes ('r_eff' argument) not specified.
For models fit with MCMC, the reported PSIS effective sample sizes and
MCSE estimates will be over-optimistic.
Warning: Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.
waic_slopes <- waic(log_lik_slopes)
Warning:
4 (1.1%) p_waic estimates greater than 0.4. We recommend trying loo instead.
compare(loo_pois, loo_mult_pois,
loo_mult_NB, loo_hier_NB,
loo_hier_NB_ncp, loo_slopes)
Warning: 'compare' is deprecated.
Use 'loo_compare' instead.
See help("Deprecated")
looic1<-loo_pois$estimates[3,1]
looic2<-loo_mult_pois$estimates[3,1]
looic3<-loo_mult_NB$estimates[3,1]
looic4<-loo_hier_NB$estimates[3,1]
looic5<-loo_hier_NB_ncp$estimates[3,1]
looic6<-loo_slopes$estimates[3,1]
looics <-c(looic1, looic2, looic3, looic4, looic5,
looic6)
waic1<-waic_pois$estimates[3,1]
waic2<-waic_mult_pois$estimates[3,1]
waic3<-waic_mult_NB$estimates[3,1]
waic4<-waic_hier_NB$estimates[3,1]
waic5<-waic_hier_NB_ncp$estimates[3,1]
waic6<-waic_slopes$estimates[3,1]
waics <-c(waic1, waic2, waic3, waic4, waic5,
waic6)
par(xaxt="n", mfrow=c(1,2))
plot(looics, type="b", xlab="", ylab="LOOIC")
par(xaxt="s")
axis(1, c(1:6), c("Pois", "Pois mult", "NB",
"Hier NB", "Hier NB ncp", "NB slopes"), las=2)
par(xaxt="n")
plot(waics, type="b", xlab="", ylab="WAIC")
par(xaxt="s")
axis(1, c(1:6), c("Pois", "Pois mult", "NB",
"Hier NB", "Hier NB ncp", "NB slopes"), las=2)
The model with varying slopes appears to be the best in terms of predictive information criteria.